79 research outputs found
Web Content Extraction - a Meta-Analysis of its Past and Thoughts on its Future
In this paper, we present a meta-analysis of several Web content extraction
algorithms, and make recommendations for the future of content extraction on
the Web. First, we find that nearly all Web content extractors do not consider
a very large, and growing, portion of modern Web pages. Second, it is well
understood that wrapper induction extractors tend to break as the Web changes;
heuristic/feature engineering extractors were thought to be immune to a Web
site's evolution, but we find that this is not the case: heuristic content
extractor performance also tends to degrade over time due to the evolution of
Web site forms and practices. We conclude with recommendations for future work
that address these and other findings.Comment: Accepted for publication in SIGKDD Exploration
Preface of the 31st Italian Symposium on Advanced Database Systems
This volume contains the proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD - Sistemi Evoluti per Basi di Dati), held in Galzinagno Terme (Padua, Italy) from 2 to 5 July 2023.</p
Preface of the 31st Italian Symposium on Advanced Database Systems
This volume contains the proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD - Sistemi Evoluti per Basi di Dati), held in Galzinagno Terme (Padua, Italy) from 2 to 5 July 2023.</p
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Several studies have concentrated on the generation of wrappers for web data sources. As
wrappers can be easily described as grammars, the grammatical inference heritage could play a
significant role in this research field. Recent results have identified a new subclass of regular
languages, called prefix mark-up languages, that nicely abstract the structures usually found in
HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a
PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many
real-life web pages do not fall in this class of languages. In this article we analyze the roots of
the problem and we propose a technique to transform pages in order to bring them into the class
of prefix mark-up languages. In this way, we have a practical solution without renouncing to
the formal background defined within the grammatical inference framework. We report on some
experiments that we have conducted on real-life web pages to evaluate the approach; the results
of this activity demonstrate the effectiveness of the presented techniques
The RoadRunner Project: Towards Automatic Extraction of Web Data
Introduction ROADRUNNER is a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites that publish large amounts of data in a fairly complex structure. In our view, we aim at ideally seeing the data extraction process of a data-intensive Web site as a black-box taking as input the URL of an entry point to the site (e.g. the home page), and returning as output data extracted from HTML pages in the site in a structured database-like format. This paper describes the top-level software architecture of the ROADRUNNER System, which has been specifically designed to automatize the data extraction process. Several components of the system have already been implemented, and preliminary experiments show the feasibility of our ideas. Data-intensive Web sites usually share a number
Clustering Web pages based on their structure
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs
that extract data from HTML pages, and transform them into a more structured format, typically in XML.
These techniques automatically induce a wrapper from a set of sample pages that share a common HTML
template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper
inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically
discovering the main classes of pages offered by a site by exploring only a small yet representative portion
of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we
have developed an algorithm that accepts the URL of an entry point to a targetWeb site, visits a limited yet
representative number of pages, and produces an accurate clustering of pages based on their structure. We
have developed a prototype, which has been used to perform experiments on real-life Web sites
- …